[Diffusion] [AMD] Online MXFP4 and FP8 Quantization for Multimodal Generation by ColinZ22 · Pull Request #21431 · sgl-project/sglang

ColinZ22 · 2026-03-25T23:11:37Z

Motivation

Adding Online MXFP4 (For AMD GPUs) and FP8 Quantization for multimodal (image and video) generation with models like Z-Image-Turbo and Wan 2.2.

Modifications

New --quantization server argument allowing loading unquantized model and quantizing weights and activations to MXFP4.
New --quantization-ignored-layers server argument allows skipping certain layers for online quantization (keeping in full precision)
New Mxfp4Config and Mxfp4LinearMethod classes utilizing AITER dynamic MXFP4 quantization and MXFP4 GEMM kernels.
Enabling FP8 online quantization via --quantization.

Usage Example

To online quantize a Diffusion Model to FP8 or MXFP4, simply add the --quantization argument:

sglang generate \
  --model-path Tongyi-MAI/Z-Image-Turbo \
  --prompt "A beautiful sunset over the mountains" \
  --save-output
  --quantization fp8

sglang generate \
  --model-path Tongyi-MAI/Z-Image-Turbo \
  --prompt "A beautiful sunset over the mountains" \
  --save-output
  --quantization mxfp4

Generation Quality Comparison

Prompt 1: "A cat sitting at the top of a mountain looking down at a futuristic city"

FP16	FP8	MXFP4

Prompt 2: "A crowd of people of various age at a busy outdoor marketplace"

FP16	FP8	MXFP4

Prompt 3: "A young child blowing dandelion seeds, golden hour lighting"

FP16	FP8	MXFP4

Prompt 4: "A city street at sunset with snow-capped mountain in the distant background"

FP16	FP8	MXFP4

Performance Benchmarking

Model: Z-Image-Turbo
Dataset: 200 images from HuggingFace Parti-Prompts

Online Quant Method	Transformer Size (GB)	Peak Mem Size (GB)	Total Gen Time (sec)	Denoise Time (sec)	Avg CLIP Score (↑)
bf16 (baseline)	11.46	19.00	201.06	132.54	32.20
fp8	5.86 (-49%)	13.42 (-29%)	191.91 (-5%)	131.74 (-1%)	32.31
mxfp4	3.23 (-72%)	10.77 (-43%)	165.05 (-18%)	104.78 (-21%)	32.22

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-25T23:11:41Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Co-authored-by: Bowen Bao <bowenbao@amd.com>

gemini-code-assist · 2026-04-01T19:37:19Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

mickqian

could you also mention this new server arg in cli.md, quantization.md or other related places?

ColinZ22 · 2026-04-07T23:44:34Z

could you also mention this new server arg in cli.md, quantization.md or other related places?

Added documentation in cli.md and quantization.md

…fig fix for zimage, and mxfp4 perf improvements

BowenBao

LGTM.

@mickqian , @HaiShaw , @avjves This PR covers a superset of the functionality in #23373. Would it make sense to consolidate the effort and land this one instead? We’re happy to rebase if #23373 lands first, though it seems unnecessary from our perspective.

avjves · 2026-04-22T18:44:56Z

LGTM.

@mickqian , @HaiShaw , @avjves This PR covers a superset of the functionality in #23373. Would it make sense to consolidate the effort and land this one instead? We’re happy to rebase if #23373 lands first, though it seems unnecessary from our perspective.

Definitely, I'm happy either way as long as the functionality lands! I originally didn't notice this PR before I had already created a new one.

mickqian · 2026-04-25T03:29:15Z


+## Online Quantization
+
+Online quantization applies quantization to unquantized models at load time. This is useful for when pre-quantized checkpoints are not available.


nit: add (on-the-fly / load-time quantization) as well

mickqian · 2026-04-25T03:29:56Z

/tag-and-rerun-ci

avjves · 2026-05-11T07:28:28Z

@ColinZ22 PR #20922 was merged, which adds the initial support for online quantization, including FP8 quantization. It's missing MXFP4 quantization still though. Are you planning on updating this PR to match the current state to get it merged? :)

ColinZ22 · 2026-05-11T17:24:14Z

@ColinZ22 PR #20922 was merged, which adds the initial support for online quantization, including FP8 quantization. It's missing MXFP4 quantization still though. Are you planning on updating this PR to match the current state to get it merged? :)

@avjves Updated, thanks for letting me know!

BowenBao · 2026-05-12T21:17:47Z

@ColinZ22 please fix lint checks

ColinZ22 · 2026-05-13T17:29:16Z

Fixed, @mickqian @wisclmy0611 Re-review would be greatly appreciated! Hoping to land this PR soon.

BowenBao · 2026-05-13T20:44:26Z

@amd-bot ci-status

ColinZ22 · 2026-05-13T20:46:40Z

@amd-bot ci-status

HaiShaw · 2026-05-23T07:31:11Z

@ColinZ22 FP8 path is broken on main - FYI.
cc @yichiche @yctseng0211

yichiche · 2026-05-25T07:35:37Z

@HaiShaw Here is the PR to re-activate fp8 aiter backend. #26261

yichiche · 2026-05-25T08:55:24Z

@ColinZ22 Currently we will see error if we --enable-torch-compile true with --quantization fp8, any suggestion?

ColinZ22 · 2026-05-26T15:24:38Z

Hi @yichiche, could you try adding @torch.compiler.disable to Fp8LinearMethod.apply()? This creates a graph break to avoid torch compile crash due to Inductor can't lower aten._scaled_mm used in the FP8 GEMM op.

It should fix the error and allow torch compile to bring significant performance improvements. (For the online MXFP4 quantization path another fix is needed to enable torch compile; SGLang Diffusion disables torch compile for all paths by default due to issues like this)

…l-project#21431) Co-authored-by: Bowen Bao <bowenbao@amd.com> Co-authored-by: HAI <hixiao@gmail.com>

ColinZ22 and others added 3 commits March 25, 2026 21:52

Add online MXFP4 and FP8 Quantization Support

f810e6d

Merge branch 'sgl-project:main' into mxfp4

a7e10a2

formatting

11f217c

github-actions Bot added the diffusion SGLang Diffusion label Mar 25, 2026

ColinZ22 changed the title ~~Online MXFP4 and FP8 Quantization for Multimodal Generation~~ [Diffusion] Online MXFP4 and FP8 Quantization for Multimodal Generation Mar 26, 2026

BowenBao reviewed Mar 26, 2026

View reviewed changes

Comment thread python/sglang/multimodal_gen/runtime/loader/fsdp_load.py Outdated

Comment thread python/sglang/multimodal_gen/runtime/server_args.py Outdated

Comment thread python/sglang/multimodal_gen/runtime/layers/quantization/mxfp4.py Outdated

ColinZ22 and others added 3 commits March 26, 2026 16:34

fixed comment and dup

d9ea94a

Update server_args.py

a25f5d9

Co-authored-by: Bowen Bao <bowenbao@amd.com>

comment updatee

78eb43f

ColinZ22 mentioned this pull request Mar 31, 2026

[RFC] AMD Quark Quantized Diffusion Image&Video Model Deployment in SGLang #21769

Open

ColinZ22 marked this pull request as ready for review April 1, 2026 19:37

ColinZ22 requested review from BBuf, mickqian, ping1jing2, yhyang201 and yingluosanqian as code owners April 1, 2026 19:37

mickqian reviewed Apr 2, 2026

View reviewed changes

Comment thread python/sglang/multimodal_gen/runtime/layers/quantization/mxfp4.py

Comment thread python/sglang/multimodal_gen/runtime/server_args.py

ColinZ22 added 2 commits April 7, 2026 22:59

merge sglang/main

1ae5c8e

merge from main and add docs

86d5d9a

github-actions Bot added documentation Improvements or additions to documentation quant LLM Quantization labels Apr 7, 2026

ColinZ22 changed the title ~~[Diffusion] Online MXFP4 and FP8 Quantization for Multimodal Generation~~ [ROCM][Diffusion] Online MXFP4 and FP8 Quantization for Multimodal Generation Apr 16, 2026

ColinZ22 changed the title ~~[ROCM][Diffusion] Online MXFP4 and FP8 Quantization for Multimodal Generation~~ [ROCM] [Diffusion] Online MXFP4 and FP8 Quantization for Multimodal Generation Apr 16, 2026

ColinZ22 changed the title ~~[ROCM] [Diffusion] Online MXFP4 and FP8 Quantization for Multimodal Generation~~ [Diffusion] [ROCM] Online MXFP4 and FP8 Quantization for Multimodal Generation Apr 16, 2026

ColinZ22 changed the title ~~[Diffusion] [ROCM] Online MXFP4 and FP8 Quantization for Multimodal Generation~~ [Diffusion] [AMD] Online MXFP4 and FP8 Quantization for Multimodal Generation Apr 16, 2026

add quantization-ignored-layers, packed modules mapping and quant_con…

72c846a

…fig fix for zimage, and mxfp4 perf improvements

ColinZ22 requested a review from DarkSharpness as a code owner April 16, 2026 22:06

BowenBao approved these changes Apr 22, 2026

View reviewed changes

ColinZ22 requested a review from mickqian April 24, 2026 18:29

mickqian approved these changes Apr 25, 2026

View reviewed changes

Merge branch 'main' into mxfp4

634aa7c

mickqian requested a review from wisclmy0611 as a code owner April 25, 2026 03:30

github-actions Bot added the run-ci label Apr 25, 2026

HaiShaw added 3 commits April 26, 2026 21:38

Merge branch 'main' into mxfp4

d98c28e

nit: fix lint

eea40f8

lint

3cb38ba

ColinZ22 added 2 commits May 11, 2026 17:16

Merged Main

3380af5

online quantization doc updates

266a480

lint fix

a7d1e8a

Merge branch 'main' into mxfp4

0bf5600

mickqian merged commit 34c0029 into sgl-project:main May 14, 2026
119 of 145 checks passed

HaiShaw assigned yichiche and yctseng0211 May 23, 2026

ColinZ22 mentioned this pull request May 26, 2026

[Fix] Fix FP8 Online Quantization #26415

Open

Shunkangz pushed a commit to Shunkangz/sglang that referenced this pull request May 27, 2026

[diffusion] [AMD] feat: support online MXFP4 and fp8 quantization (sg…

01051cc

…l-project#21431) Co-authored-by: Bowen Bao <bowenbao@amd.com> Co-authored-by: HAI <hixiao@gmail.com>


		## Online Quantization

		Online quantization applies quantization to unquantized models at load time. This is useful for when pre-quantized checkpoints are not available.

Conversation

ColinZ22 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage Example

Generation Quality Comparison

Prompt 1: "A cat sitting at the top of a mountain looking down at a futuristic city"

Prompt 2: "A crowd of people of various age at a busy outdoor marketplace"

Prompt 3: "A young child blowing dandelion seeds, golden hour lighting"

Prompt 4: "A city street at sunset with snow-capped mountain in the distant background"

Performance Benchmarking

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot commented Apr 1, 2026

Uh oh!

mickqian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ColinZ22 commented Apr 7, 2026

Uh oh!

BowenBao left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avjves commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mickqian Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

mickqian commented Apr 25, 2026

Uh oh!

avjves commented May 11, 2026

Uh oh!

ColinZ22 commented May 11, 2026

Uh oh!

BowenBao commented May 12, 2026

Uh oh!

ColinZ22 commented May 13, 2026

Uh oh!

BowenBao commented May 13, 2026

Uh oh!

ColinZ22 commented May 13, 2026

Uh oh!

Uh oh!

HaiShaw commented May 23, 2026

Uh oh!

yichiche commented May 25, 2026

Uh oh!

yichiche commented May 25, 2026

Uh oh!

ColinZ22 commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ColinZ22 commented Mar 25, 2026 •

edited

Loading

BowenBao left a comment •

edited

Loading

avjves commented Apr 22, 2026 •

edited

Loading

ColinZ22 commented May 26, 2026 •

edited

Loading